RFdiffusion Exercise

RFdiffusion binder design

We will start with the ‘manual’ way, running each of the steps in the RFdiffusion -> ProteinMPNN -> Alphafold2 initial guess workflow individually.

Then, we will run the nf-binder-design workflow that combines these steps into a more streamlined pipeline to better suit ‘production’ use on high-performance computing.

The target

(press the spanner icon to see the sequence, )

Here’s version of the PDL1 domain we cropped in the previous exercise: PDL1.pdb

On your server, let’s create a directory for our RFDiffusion work and upload this PDB file to input/PDL1.pdb:

# We should already have a user directory
#mkdir -p /scratch2/pd27/users/${USER}
#cd /scratch2/pd27/users/${USER}

cd /scratch2/pd27/users/${USER}
mkdir -p exercises/rfd/input
cd exercises/rfd

# check your working directory
pwd

# see what files and directories you have here
tree -L 2

If you have the trimmed PDL1.pdb file on your computer, upload it to /scratch2/pd27/users/${USER}/exercises/rfd/input/PDL1.pdb by dragging and dropping it onto the file explorer in vscode.

If you’d like to use a pre-prepared PDL1.pdb rather than your own, in the Termainl run:

wget -O input/PDL1.pdb https://australian-protein-design-initiative.github.io/binder-design-workshop/exercises/rfd/input/PDL1.pdb

Running RFdiffusion binder design

RFdiffusion is a general tool for generating protein structures via diffusion-based modeling - not only de novo binder design.

Here, we are going to run RFdiffusion with parameters specific for generating a small de novo binder chains, hopefully with good shape complementarity to our target and near our hotspots.

Let’s start by running the command, and while things are running we can break down what each part does:

# Load the RFdiffusion module that is pre-installed on the M3 HPC cluster
# This puts the RFdiffusion script `run_inference.py` on your PATH
#
module load rfdiffusion/.b44206a

mkdir -p output/rfdiffusion

run_inference.py \
  inference.input_pdb=input/PDL1.pdb \
  'contigmap.contigs=[A18-132/0 65-120]' \
  'ppi.hotspot_res=[A56]' \
  inference.output_prefix=output/rfdiffusion/pdl1_test \
  inference.num_designs=4 \
  denoiser.noise_scale_ca=0 \
  denoiser.noise_scale_frame=0

On a T4 GPU, this takes about ~8 minutes (~2 minutes per design).

Note that this version of the workshop is tailored for the M3 HPC cluster where we’ve pre-installed each tool as a module. If you are running this on a different system, you will need to adapt these commands to use the Apptainer containers less transparently - see Appendix: Running the Apptainer containers on any system.

Watching GPU utilization with nvitop

module load nvitop
# (nvitop is also pip-installable)

nvitop

# a more bare-bones version that is usually always installed when an NVIDIA GPU is present is
# nvidia-smi -l 10

Open a second terminal on your compute node where RFdiffusion is running and watch the GPU utilization with nvitop.

RFdiffusion options

  • inference.input_pdb: our target PDB file- this should contain (possibly cropped) target coordinates

  • contigmap.contigs: define the regions of the target we want to include (A18-132/0), and a length range for the new chain to generate (65-120)

  • ppi.hotspot_res: our hotspot residues

  • inference.output_prefix: the prefix for the output files (can be {directory}/{filename} prefix)

  • inference.num_designs: the number of designs (trajectories) to generate - we generate just a small number here - normally this might be 1000 or more

  • denoiser.noise_scale_ca and denoiser.noise_scale_frame: the noise scale for the translations and rotations - set to zero, since this is reported to improve the quality of the models as the expense of diversity (0.5 might also be a reasonable value)

More on the contig syntax

The contig syntax is a way of specifying existing residues in the target to include, and new residues / chains to add by in-painting.

A18-132 says ‘include the existing chain A, residues 18-132’ - we could exclude an N-terminal region like A27-132 - this would be equivalent to deleting those ATOM records.

The /0 at the end of A18-132/0 specifies a chain break. This is important - if you exclude it, the new generated residues will be fused to the C-terminal end of your target !

If we had a second chain B in the target, we might have something like: A18-132/0 B33-148/0 65-120

If we had a missing loop in our target spanning residues 73-83, we would need: A18-72/A84-132/0 65-120 (RFdiffusion will complain with an error if you include residues in a contig that don’t exist)

If a segment does not have a chain ID specfied, like 65-120, this is treated as a new chain to hallicinate, with a lower and upper length range.

CautionChallenge - defining contigs

If we used the full original 3BIK.pdb PDB file instead of manually trimming our coordinates in ChimeraX/Pymol/Vim, what contig expression could we use to ‘virtually’ trim to our domain of interest ?

How would we generate binders to our PDL1 domain that are exactly 100 residues long ?

You can see the full list of configuration options for RFdiffusion with:

run_inference.py --help

… most should usually be left as the defaults.

Some, like inference.ckpt_override_path are automatically set to select the correct model weights based on other config options in use.

For binder design, manually setting inference.ckpt_override_path=/models/rfdiffusion/Complex_beta_ckpt.pt can be useful to increase the beta-strand content of designs (this model is reportedly less well tested - YMMV !).

/models/rfdiffusion/ corresponds to the path where your RFdiffusion model weights were downloaded to - /models/rfdiffusion/ is a valid path in the context of the containers (and M3 module) we are using here but may be different for other installations of RFdiffusion.

Viewing the results

TODO:

ProteinMPNN inverse folding

Given backbone coordinates of a protein (or protein complex), ProteinMPNN generates a sequence with sidechain conformations compatible with the fold and interface with the target.

# First, make sure we are in the right directory
cd /scratch2/pd27/users/${USER}/exercises/rfd
pwd

# Load the dl_binder_design (ProteinMPNN) module that is pre-installed on the M3 HPC cluster
# This puts the script `dl_interface_design.py` (from https://github.com/nrbennet/dl_binder_design/blob/main/mpnn_fr/dl_interface_design.py) on your PATH
#
module load dl_binder_design/.cafa385

mkdir -p output/proteinmpnn

dl_interface_design.py \
    -pdbdir output/rfdiffusion/ \
    -relax_cycles 0 \
    -seqs_per_struct 2 \
    -outpdbdir output/proteinmpnn/ \
    -omit_AAs C

Here we randomly sample two sequences for the same backbone structure (-seqs_per_struct 2). You may choose to generate more (five ?) or just one.

We -omit_AAs C to avoid cysteine residues in the generated sequences - this prevents solubility and folding issues sometimes often with disulfide bonds or unintended disulfide mediated dimers during downstream expression and purification.

The dl_binder_design uses ‘ProteinMPNN-FastRelax’, where the -relax_cycles option specifies to run cycles of sequence prediction and relaxtion using Rosetta FastRelax to improve the compatibility of the sequence generated. -relax_cycles > 0 doesn’t work with -seqs_per_struct > 1 - if you want to generate more than one sequence AND use -relax_cycles 3 (or 5 ?), you’ll need to run dl_interface_design.py several times.

Other useful options:

  • -checkpoint_path
  • -temperature
  • -augment_eps

ProteinMPNN detects the GPU on our compute node, but it isn’t appreciably faster on GPU - if running in a separate job, no need to waste a GPU for this.

You can look at one of the output PDB files in output/proteinmpnn/ using the vscode Protein Viewer - click on the backbone near the hotspot and you’ll see sidechains are modeled.

Alphafold2 initial guess - scoring by prediction

Once sequences have been generated for the backbone design via inverse folding, we want to predict more accurately if this binder sequence is likely to fold into the correct structure.

The confidence metrics generated by Alphafold2 (initial guess) serve as the primary scores for ranking and selecting designs most likely to bind.

# Load the dl_binder_design (Alphafold2 initial guess) module that is pre-installed on the M3 HPC cluster
# This puts the script `predict.py`(from https://github.com/nrbennet/dl_binder_design/blob/main/af2_initial_guess/predict.py) on your PATH
#
module load dl_binder_design/.cafa385

mkdir -p output/af2_initial_guess/pdbs

predict.py \
    -pdbdir output/proteinmpnn \
    -outpdbdir output/af2_initial_guess/pdbs/ \
    -recycle 3 \
    -scorefilename output/af2_initial_guess/pdl1_test.scores.cs

On a T4 GPU, each AF2 initial guess structure for these complexes takes ~1.3 mins each (~10 min total), on average.

Look at output/af2_initial_guess/pdl1_test.scores.cs - this file contains a range of scores to assess the quality of this ‘quick and dirty’ structure prediction. In particular, we are interested in designs with scores:

  • pae_interaction < 10

and

  • plddt_binder > 80

nf-binder-design : putting it all together

To do a more realistic de novo binder design campaign, you’ll need to generate several 1000s of RFdiffusion trajectories, at least one (probably 2 or 3) proposed sequences for each backbone design with ProteinMPNN, and then run Alphafold2 initial guess on each of these to generate scores and putative complex structures (2 seqs * 1000 backbones = 2000).

You could write some shell script loops, get fancy with SLURM array jobs, write some Python/R/awk/Perl to bring it all together, and turn the commands above into a workflow.

Rather than cobble something together, we suggest you use nf-binder-design a Nextflow pipeline designed to do this. Nextflow is particularly well suited to running workflows like this, and automatically handles software installation (reproducibly across systems via containers), handles HPC queue submission and retries, and generally automates many manual steps as much as practical.

De novo binders against PDL1 with nf-binder-design

Here’s our example above, using the nf-binder-design RFdiffusion workflow.

First create a directory for our work and copy the PDL1 PDB file to it:

mkdir -p /scratch2/pd27/users/${USER}/exercises/rfd-nfbd/input
cd /scratch2/pd27/users/${USER}/exercises/rfd-nfbd
cp ../rfd/input/PDL1.pdb input/

# check your working directory
pwd

# see what files and directories you have so far
tree -L 2

Now we are ready to run the pipeline:

# Load the nextflow module that is pre-installed on the M3 HPC cluster
# Any recent version from the last ~12 months should work
module load nextflow/24.04.3 

# Nextflow automatically downloads containers for the software automatically.
# To save some time, the containers for the pipeline have been pre-cached for the workshop.
# We set these environment variables to point to the Apptainer cache location.
# Outside the workshop, this would be somewhere on your scratch space 
# (don't use your /home directory - it will go over quota !)
# You might add exports like this to your ~/.bashrc
export NXF_APPTAINER_CACHEDIR=/scratch2/pd27/shared/containers
export APPTAINER_CACHEDIR=$NXF_APPTAINER_CACHEDIR
unset NXF_SINGULARITY_CACHEDIR

nextflow run Australian-Protein-Design-Initiative/nf-binder-design \
  --input_pdb 'input/*.pdb' \
  --outdir results \
  --contigs "[A18-132/0 65-120]" \
  --hotspot_res "A56" \
  --rfd_n_designs=4 \
  --rfd_batch_size=1 \
  --pmpnn_seqs_per_struct=2 \
  --pmpnn_relax_cycles=1 \
  -profile local \
  -resume

#   --rfd_filters="rg<=20" \

(this is actually a slightly simiplified version of one of the examples in the nf-binder-design Github repository)

You can see all the options available with:

nextflow run Australian-Protein-Design-Initiative/nf-binder-design --help
Tip

As a general rule, parameters are named to match the underlying tool, with an rfd_, pmpnn_ or af2fig prefix. Extra parameters can be configured via --rfd_extra_args or a nextflow.config file:

// nextflow.config
process {
    withName: RFDIFFUSION {
        ext.args = 'potentials.guiding_potentials=[\"type:binder_ROG,weight:7,min_dist:10\"] potentials.guide_decay="quadratic"'
    }
}

Note the use of ' single quotes and escaped \" double quotes - this is important to ensure the string is passed correctly to the RFdiffusion command.